Skip to content

fix(dab): isolate per-query validator failures in verify_batch#19

Merged
kentwelcome merged 2 commits into
mainfrom
fix/dab-verify-batch-per-query-isolation
Jun 21, 2026
Merged

fix(dab): isolate per-query validator failures in verify_batch#19
kentwelcome merged 2 commits into
mainfrom
fix/dab-verify-batch-per-query-isolation

Conversation

@kentwelcome

Copy link
Copy Markdown
Contributor

Problem

In DAB batch mode, verify_batch.py runs validate_qN.py for each query inside one
emit_reward loop with no per-query error isolation. If a single validator raises at
call time, the whole script crashes — no reward.json is written — and the harness records
the trial as a RewardFileNotFoundError error, which drops the entire dataset from the
run
. It silently leaves the stratified-Pass@1 denominator rather than scoring 0.

Observed on PATENTS: the agent returned the q1 answer as a JSON list; validate_q1.py
calls llm_output.lower()AttributeError: 'list' object has no attribute 'lower'. All of
q1/q2/q3 lost their rewards and PATENTS vanished from the summary — inflating the reported
stratified Pass@1 (computed over the surviving 11 datasets instead of 12).

Fix

Wrap the per-query validate_fn(answer) call in try/except: score the offending query
0.0 with the exception text as the reason, and continue grading the remaining queries.
A malformed (e.g. non-string) answer is a content failure → reward 0, not a harness error.

Scope is deliberately narrow: the except only covers the runtime call. Validator
import failures (e.g. a missing common_scaffold dependency, which indicates a broken
verifier setup affecting every query) still crash loudly — preserved by the existing
test_batch_verify_does_not_mask_validator_import_errors.

Tests

  • New test_batch_verify_isolates_per_query_runtime_validator_error: a list-typed answer
    scores its query 0 (with reason), a sibling well-typed query still grades, reward.json is
    written (mean 0.5).
  • Existing import-error-stays-loud and happy-path tests unchanged and passing (3/3).

🤖 Generated with Claude Code

kentwelcome and others added 2 commits June 21, 2026 11:40
A validator raising (e.g. validate_qN calling .lower() on a non-string
answer when the agent returns a JSON list) aborted emit_reward for the
whole dataset, so no reward.json was written and the entire dataset was
dropped from the run as a RewardFileNotFoundError trial error — silently
removing it from the stratified Pass@1 denominator rather than scoring 0.

Wrap the per-query validate_fn call in try/except: score the offending
query 0.0 with the exception as the reason and continue grading the rest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 21, 2026 11:41

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the DAB batch verifier so that a single query’s validator crashing at runtime no longer aborts reward emission for the entire dataset, preventing dropped datasets and inflated aggregate metrics.

Changes:

  • Wrap per-query validate_fn(answer) calls in verify_batch.py with try/except to convert runtime validator exceptions into a 0.0 reward and an explanatory reason, while continuing to grade remaining queries.
  • Add a unit test ensuring a malformed answer for one query does not prevent sibling queries from being graded and does not prevent reward.json from being written.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/razorback-plugin-dab/src/razorback_plugin_dab/verify/verify_batch.py Adds per-query runtime exception isolation around validator execution, recording a zero reward + reason instead of crashing.
packages/razorback-plugin-dab/tests/unit/test_verify_batch_reward_shape.py Adds coverage to assert validator runtime errors are isolated per query and artifacts are still written.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kentwelcome kentwelcome merged commit d9fc8a8 into main Jun 21, 2026
1 check passed
@kentwelcome kentwelcome deleted the fix/dab-verify-batch-per-query-isolation branch June 21, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants